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Abstract 


We study a number of open issues in spectral clustering: (i) Selecting the 
appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Cluster- 
ing with irregular background clutter, and, (iv) Finding automatically the 
number of groups. We first propose that a ‘local’ scale should be used to 
compute the affinity between each pair of points. This local scaling leads 
to better clustering especially when the data includes multiple scales and 
when the clusters are placed within a cluttered background. We further 
suggest exploiting the structure of the eigenvectors to infer automatically 
the number of groups. This leads to a new algorithm in which the final 
randomly initialized k-means stage is eliminated. 


1 Introduction 


Clustering is one of the building blocks of modern data analysis. Two commonly used 
methods are K-means and learning a mixture-model using EM. These methods, which are 
based on estimating explicit models of the data, provide high quality results when the data 
is organized according to the assumed models. However, when it is arranged in more com- 
plex and unknown shapes, these methods tend to fail. An alternative clustering approach, 
which was shown to handle such structured data is spectral clustering. It does not require 
estimating an explicit model of data distribution, rather a spectral analysis of the matrix 
of point-to-point similarities. A first set of papers suggested the method based on a set of 
heuristics (e.g., [8, 9]). A second generation provided a level of theoretical analysis, and 
suggested improved algorithms (e.g., [6, 10, 5, 4, 3]). 


There are still open issues: (i) Selection of the appropriate scale in which the data is to 
be analyzed, (i) Clustering data that is distributed according to different scales, (iii) Clus- 
tering with irregular background clutter, and, (iv) Estimating automatically the number of 
groups. We show here that it is possible to address these issues and propose ideas to tune 
the parameters automatically according to the data. 


1.1 Notation and the Ng-Jordan-Weiss (NJW) Algorithm 
The analysis and approaches suggested in this paper build on observations presented in [5]. 
For completeness of the text we first briefly review their algorithm. 


Given a set of n points S = {51,..., Sn } in R? cluster them into C clusters as follows: 


1. Form the affinity matrix A € R"*” defined by A;; = exp ( =a (si83)) fori Æ j 
and A;; = 0, where d(s;, sj) is some distance function, often just the Euclidean 
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Figure 1: Spectral clustering without local scaling (using the NJW algorithm.) Top row: 
When the data incorporates multiple scales standard spectral clustering fails. Note, that 
the optimal o for each example (displayed on each figure) turned out to be different. Bottom 
row: Clustering results for the top-left point-set with different values of o. This highlights 
the high impact o has on the clustering quality. In all the examples, the number of groups 
was set manually. The data points were normalized to occupy the |—1, 1] space. 


distance between the vectors s; and sj. ø is a scale parameter which is further 
discussed in Section 2. 


2. Define D to be a diagonal matrix with Di; = Dae Aij and construct the nor- 
malized affinity matrix L = D~'/?AD~\/?. 
3. Manually select a desired number of groups C. 


4. Find z1,..., £c, the C largest eigenvectors of L, and form the matrix X = 
[z1,-.-,%0] ER". 


5. Re-normalize the rows of X to have unit length yielding Y € R”*°, such that 
Yi; = Xy / Z; Sa) 
6. Treat each row of Y as a point in R© and cluster via k-means. 


7. Assign the original point s; to cluster c if and only if the corresponding row 7 of 
the matrix Y was assigned to cluster c. 


In Section 2 we analyze the effect of o on the clustering and suggest a method for setting 
it automatically. We show that this allows handling multi-scale data and background clut- 
ter. In Section 3 we suggest a scheme for finding automatically the number of groups C. 
Our new spectral clustering algorithm is summarized in Section 4. We conclude with a 
discussion in Section 5. 


2 Local Scaling 


As was suggested by [6] the scaling parameter is some measure of when two points are 
considered similar. This provides an intuitive way for selecting possible values for ø. The 
selection of ø is commonly done manually. Ng et al. [5] suggested selecting o automat- 
ically by running their clustering algorithm repeatedly for a number of values of o and 
selecting the one which provides least distorted clusters of the rows of Y. This increases 
significantly the computation time. Additionally, the range of values to be tested still has to 
be set manually. Moreover, when the input data includes clusters with different local statis- 
tics there may not be a singe value of ø that works well for all the data. Figure 1 illustrates 
the high impact ø has on clustering. When the data contains multiple scales, even using the 
optimal o fails to provide good clustering (see examples at the right of top row). 


(a) (b) 


Figure 2: The effect of local scaling. (a) Input data points. A tight cluster resides within 
a background cluster. (b) The affinity between each point and its surrounding neighbors 
is indicated by the thickness of the line connecting them. The affinities across clusters are 
larger than the affinities within the background cluster. (c) The corresponding visualization 
of affinities after local scaling. The affinities across clusters are now significantly lower 
than the affinities within any single cluster. 


Introducing Local Scaling: Instead of selecting a single scaling parameter o we propose 
to calculate a local scaling parameter o; for each data point s;. The distance from s; 
to sj as ‘seen’ by s; is d(s;,s;)/o; while the converse is d(s;,s;)/o;. Therefore the 
square distance d? of the earlier papers may be generalized as d(s;,8;)d(s;, 8;)/oi0; = 
d*(s;,8;)/oi0; The affinity between a pair of points can thus be written as: 


Ble: x 
Ajj = exp (a) (1) 


O70; 


Using a specific scaling parameter for each point allows self-tuning of the point-to-point 
distances according to the local statistics of the neighborhoods surrounding points 7 and j. 


The selection of the local scale g; can be done by studying the local statistics of the neigh- 
borhood of point s;. A simple choice, which is used for the experiments in this paper, 
is: 

oi = d(si, sx) (2) 


where sg is the K’th neighbor of point s;. The selection of K is independent of scale 
and is a function of the data dimension of the embedding space. Nevertheless, in all our 
experiments (both on synthetic data and on images) we used a single value of K = 7, which 
gave good results even for high-dimensional data (the experiments with high-dimensional 
data were left out due to lack of space). 


Figure 2 provides a visualization of the effect of the suggested local scaling. Since the data 
resides in multiple scales (one cluster is tight and the other is sparse) the standard approach 
to estimating affinities fails to capture the data structure (see Figure 2.b). Local scaling 
automatically finds the two scales and results in high affinities within clusters and low 
affinities across clusters (see Figure 2.c). This is the information required for separation. 


We tested the power of local scaling by clustering the data set of Figure 1, plus four ad- 
ditional examples. We modified the Ng-Jordan-Weiss algorithm reviewed in Section 1.1 
substituting the locally scaled affinity matrix A (of Eq. (1)) instead of A. Results are shown 
in Figure 3. In spite of the multiple scales and the various types of structure, the groups 
now match the intuitive solution. 


3 Estimating the Number of Clusters 


Having defined a scheme to set the scale parameter automatically we are left with one 
more free parameter: the number of clusters. This parameter is usually set manually and 
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Figure 3: Our clustering results. Using the algorithm summarized in Section 4. The 
number of groups was found automatically. 
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Figure 4: Eigenvalues. The first 10 eigenvalues of L corresponding to the top row data 
sets of Figure 3. 


not much research has been done as to how might one set it automatically. In this section 
we suggest an approach to discovering the number of clusters. The suggested scheme turns 
out to lead to a new spatial clustering algorithm. 


3.1 The Intuitive Solution: Analyzing the Eigenvalues 


One possible approach to try and discover the number of groups is to analyze the eigenval- 
ues of the affinity matrix. The analysis given in [5] shows that the first (highest magnitude) 
eigenvalue of L (see Section 1.1) will be a repeated eigenvalue of magnitude 1 with mul- 
tiplicity equal to the number of groups C. This implies one could estimate C by counting 
the number of eigenvalues equaling 1. 


Examining the eigenvalues of our locally scaled matrix, corresponding to clean data-sets, 
indeed shows that the multiplicity of eigenvalue 1 equals the number of groups. However, 
if the groups are not clearly separated, once noise is introduced, the values start to deviate 
from 1, thus the criterion of choice becomes tricky. An alternative approach would be to 
search for a drop in the magnitude of the eigenvalues (this was pursued to some extent by 
Polito and Perona in [7]). This approach, however, lacks a theoretical justification. The 
eigenvalues of L are the union of the eigenvalues of the sub-matrices corresponding to 
each cluster. This implies the eigenvalues depend on the structure of the individual clusters 
and thus no assumptions can be placed on their values. In particular, the gap between the 
C’ th eigenvalue and the next one can be either small or large. Figure 4 shows the first 10 
eigenvalues corresponding to the top row examples of Figure 3. It highlights the different 
patterns of distribution of eigenvalues for different data sets. 


3.2 A Better Approach: Analyzing the Eigenvectors 


We thus suggest an alternative approach which relies on the structure of the eigenvec- 
tors. After sorting L according to clusters, in the “ideal” case (i.e., when L is strictly 
block diagonal with blocks L,c = 1,...,C), its eigenvalues and eigenvectors are 
the union of the eigenvalues and eigenvectors of its blocks padded appropriately with 
zeros (see [6, 5]). As long as the eigenvalues of the blocks are different each eigen- 


vector will have non-zero values only in entries corresponding to a single block/cluster: 
cD 0 T 
X= O° 5 T where x‘°) is an eigenvector of the sub-matrix L cor- 
=~ — 
0. 0 “al 
responding to cluster c. However, as was shown above, the eigenvalue | is bound to be a 
repeated eigenvalue with multiplicity equal to the number of groups C. Thus, the eigen- 
solver could just as easily have picked any other set of orthogonal vectors spanning the 


same subspace as X’s columns. That is, X could have been replaced by X = XR for any 
orthogonal matrix R € ROO., 


This, however, implies that even if the eigensolver provided us the rotated set of vectors, 
we are still guaranteed that there exists a rotation Ê such that each row in the matrix XR 
has a single non-zero entry. Since the eigenvectors of L are the union of the eigenvectors 
of its individual blocks (padded with zeros), taking more than the first C eigenvectors will 
result in more than one non-zero entry in some of the rows. Taking fewer eigenvectors we 
do not have a full basis spanning the subspace, thus depending on the initial X there might 
or might not exist such a rotation. Note, that these observations are independent of the 
difference in magnitude between the eigenvalues. 


We use these observations to predict the number of groups. For each possible group number 
C we recover the rotation which best aligns X’s columns with the canonical coordinate 
system. Let Z € R”*° be the matrix obtained after rotating the eigenvector matrix X, 
i.e., Z = X R and denote M; = max; Zij. We wish to recover the rotation R for which in 
every row in Z there will be at most one non-zero entry. We thus define a cost function: 


I= ap (3) 


Minimizing this cost function over all possible rotations will provide the best alignment 
with the canonical coordinate system. This is done using the gradient descent scheme 
described in Appendix A. The number of groups is taken as the one providing the minimal 
cost (if several group numbers yield practically the same minimal cost, the largest of those 
is selected). 


The search over the group number can be performed incrementally saving computation 
time. We start by aligning the top two eigenvectors (as well as possible). Then, at each 
step of the search (up to the maximal group number), we add a single eigenvector to the 
already rotated ones. This can be viewed as taking the alignment result of the previous 
group number as an initialization to the current one. The alignment of this new set of 
eigenvectors is extremely fast (typically a few iterations) since the initialization is good. 
The overall run time of this incremental procedure is just slightly longer than aligning all 
the eigenvectors in a non-incremental way. 


Using this scheme to estimate the number of groups on the data set of Figure 3 provided 
a correct result for all but one (for the right-most dataset at the bottom row we predicted 
2 clusters instead of 3). Corresponding plots of the alignment quality for different group 
numbers are shown in Figure 5. 


Yu and Shi [11] suggested rotating normalized eigenvectors to obtain an optimal segmen- 
tation. Their method iterates between non-maximum suppression (i.e., setting M; = 1 and 
Zij = 0 otherwise) and using SVD to recover the rotation which best aligns the columns of 
X with those of Z. In our experiments we noticed that this iterative method can easily get 
stuck in local minima and thus does not reliably find the optimal alignment and the group 
number. Another related approach is that suggested by Kannan et al. [3] who assigned 
points to clusters according to the maximal entry in the corresponding row of the eigenvec- 
tor matrix. This works well when there are no repeated eigenvalues as then the eigenvectors 
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Figure 5: Selecting Group Number. The alignment cost (of Eq. (3)) for varying group 
numbers corresponding to the top row data sets of Figure 3. The selected group number 
marked by a red circle, corresponds to the largest group number providing minimal cost 
(costs up to 0.01% apart were considered as same value). 


corresponding to different clusters are not intermixed. Kannan et al. used a non-normalized 
affinity matrix thus were not certain to obtain a repeated eigenvalue, however, this could 
easily happen and then the clustering would fail. 


4 A New Algorithm 


Our proposed method for estimating the number of groups automatically has two desir- 
able by-products: (i) After aligning with the canonical coordinate system, one can use 
non-maximum suppression on the rows of Z, thus eliminating the final iterative k-means 
process, which often requires around 100 iterations and depends highly on its initialization. 
(ii) Since the final clustering can be conducted by non-maximum suppression, we obtain 
clustering results for all the inspected group numbers at a tiny additional cost. When the 
data is highly noisy, one can still employ k-means, or better, EM, to cluster the rows of Z. 
However, since the data is now aligned with the canonical coordinate scheme we can obtain 
by non-maximum suppression an excellent initialization so very few iterations suffice. We 
summarize our suggested algorithm: 


Algorithm: Given a set of points S = {s1,...,8,}in R! that we want to cluster: 
1. Compute the local scale g; for each point s; € S using Eq. (2). 


2. Form the locally scaled affinity matrix Â € R”*” where Ai; is defined according 
to Eq. (1) fori Æ j and A; = 0. 

3. Define D to be a diagonal matrix with D;; = yi Ai and construct the nor- 
malized affinity matrix L = D~!/2AD~!/2, 

4. Find 21,...,2%¢ the C largest eigenvectors of L and form the matrix X = 
[£1,..., £a] E R”"*°, where C is the largest possible group number. 


5. Recover the rotation R which best aligns X’s columns with the canonical coordi- 
nate system using the incremental gradient descent scheme (see also Appendix A). 


6. Grade the cost of the alignment for each group number, up to C, according to 
Eq. (3). 

7. Set the final group number Cyest to be the largest group number with minimal 
alignment cost. 


8. Take the alignment result Z of the top Crest eigenvectors and assign the original 
point s; to cluster c if and only if max;(Z7,) = Z% 

9. If highly noisy data, use the previous step result to initialize k-means, or EM, 
clustering on the rows of Z. 


We tested the quality of this algorithm on real data. Figure 6 shows intensity based image 
segmentation results. The number of groups and the corresponding segmentation were 
obtained automatically. In this case same quality of results were obtained using non-scaled 
affinities, however, this required manual setting of both o (different values for different 
images) and the number of groups, whereas our result required no parameter settings. 


Figure 6: Automatic image segmentation. Fully automatic intensity based image segmen- 
tation results using our algorithm. 


More experiments and results on real data sets can be found on our web-page 
http://www. vision.caltech.edu/lihi/Demos/SelfTuningClustering.html 


5 Discussion & Conclusions 


Spectral clustering practitioners know that selecting good parameters to tune the cluster- 
ing process is an art requiring skill and patience. Automating spectral clustering was the 
main motivation for this study. The key ideas we introduced are three: (a) using a local 
scale, rather than a global one, (b) estimating the scale from the data, and (c) rotating the 
eigenvectors to create the maximally sparse representation. We proposed an automated 
spectral clustering algorithm based on these ideas: it computes automatically the scale and 
the number of groups and it can handle multi-scale data which are problematic for previous 
approaches. 


Some of the choices we made in our implementation were motivated by simplicity and are 
perfectible. For instance, the local scale o might be better estimated by a method which 
relies on more informative local statistics. Another example: the cost function in Eq. (3) is 
reasonable, but by no means the only possibility (e.g. the sum of the entropy of the rows 
Zi might be used instead). 
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A Recovering the Aligning Rotation 


To find the best alignment for a set of eigenvectors we adopt a gradient descent scheme 
similar to that suggested in [2]. There, Givens rotations where used to recover a rotation 
which diagonalizes a symmetric matrix by minimizing a cost function which measures the 
diagonality of the matrix. Similarly, here, we define a cost function which measures the 
alignment quality of a set of vectors and prove that the gradient descent, using Givens 
rotations, converges. 


The cost function we wish to minimize is that of Eq. (3). Let m; = j such that Zi; = 
Zim; = Mi. Note, that the indices m; of the maximal entries of the rows of X might be 
different than those of the optimal Z. A simple non-maximum supression on the rows of 
X can provide a wrong result. Using the gradient descent scheme allows to increase the 
cost corresponding to part of the rows as long as the overall cost is reduced, thus enabling 
changing the indices m;. 


Similar to [2] we wish to represent the rotation matrix R in terms of the smallest possible 
number of parameters. Let Gi, 3,9 denote a Givens rotation [1] of 0 radians (counterclock- 
wise) in the (7, j) coordinate plane. It is sufficient to consider Givens rotations so that i < j, 
thus we can use a convenient index re-mapping Gk, = Gi j0» where (i, 7) is the kth entry 
of a lexicographical list of (i, j) € {1,2,...,C}° pairs with i < j. Hence, finding the 
aligning rotation amounts to minimizing the cost function J over O € [—7/2,7/2)*. The 
update rule for O is: Ox41 = Ox — a VJ|g_o, where a € R is the step size. 


We next compute the gradient of J and bounds on a for stability. For convenience we will 
further adopt the notation convention of [2]. Let Uça, b) = Gao, Ga+1,0a41 `° © Gb,o, where 


Ula) = ee <a, Uk = Uʻ(k,k) and Vi = a9, Uk: Define A), 1 < k < K, element 


wise by AG = Gu . Since Z = XR we obtain A(*) = = XU ,k-1)VkU(k+1,K)- 
We can now cori ute YJ Suy ise: 
a = = 3 EE M? A M? obk 


Due to lack of space we cannot describe in full detail the complete convergence proof. We 
thus refer the reader to [2] where it is shown that convergence is obtained when 1 — af; 


ids ee 2 
lie in the unit circle, where Fy, = li] . Note, that at © = 0 we have Z;; = 0 
©=0 
; . OZim, : : 
for j Æ Mi, Zim, = Mi, and om = 3, = AD (i.e., near © = 0 the maximal 


: = a : Os 
entry for each row does not change its index). Deriving thus gives |e 


| ij|O=0 
De aD ia AA Further substituting in the values for a |o=o yields: 
| OJ ae ifk=l 
kl = =r = . 

06,00; j|@=0 0 otherwise 
where (ik, jk) is the pair (i, j) corresponding to the index k in the index re-mapping dis- 
cussed above. Hence, by setting a small enough we get that 1 — aF;,; lie in the unit circle 
and convergence is guaranteed. 


